## BEFORE WE START, READ IN REQUIRED LIBRARIES:
require(readxl)
library(tidyverse)
require(sf)
require(tigris)
require(plotly)
require(data.table)
require(purrr)
require(gganimate)
library(ggridges)
library(leaflet)
library(cowplot)
library(patchwork)
library(slickR)
library(glue)
library(magick)
# require(devtools)
# devtools::install_github("cuhklinlab/SWKM")
library(SWKM)
options(scipen = 999)
\[\space\]
All insurance companies that provide a private passenger auto product find it very helpful to include rating by ZIP Code in their pricing. I want to lay out an example of how this is done. Luckily, we have resources from California’s Department of Insurance (DOI) available publicly online. These resources show losses, exposures, claim counts, and other relevant information by ZIP code for each coverage type provided to the insureds of California. I will get more into the definitions of this things soon.
Here I perform an empirical pricing analysis to determine ZIP Code (“territory”) based pricing factors, these factors multiply the price (which is determined through other rating variables), us actuaries call these “factors”. The analysis is empirical because we are using observed losses to determine the pricing “factor” (rather than having a GLM spit out factors corresponding to the rating variables you fit). Ideally (absent of costs and such), insurance prices would be identical to the expected “loss cost”, or “pure premium”. Thus, this exercise is empirical because we price the multiplicative insurance multiplier (that we would multiply to the pure premium) using loss costs themselves.
Here is some back ground information, to clear up some jargon:
BI: (liability) Bodily Injury (for those
harmed in an accident that aren’t the insured)PD: (liability) Property Damage (for
property/vehicles damaged by the insured in an accident (but not the
insured’s vehicle))COLL: Collision for the vehicle damage
suffered by the insured vehicle on the policyCOMP: Comprehensive non-collision related
damage to the insured’s vehicleMP: Medical Payments for injuries
sustained in an accident by the insured himselfUMPD: Uninsured Property Damage for the
vehicle of the insured in accidents where the at-fault driver is an
uninsured motoristUMBI: Uninsured Bodily Injury for the
bodily injury related expenses to the insured in accidents where the
at-fault driver is an uninsured motorist\[\space\]
We read in data from here, and consult its corresponding documentation here (found from link shown here )
From the link above, we read in the xlsx file that has our data, each tab that contains our data points for each coverage. Read each in in separately first.
## READ IN DATA
path <- "C:/Users/15708/Downloads/2008-Frequency-and-Severity-Bands-Manual-Second-Edition-Data.xlsx"
sheetnames <- readxl::excel_sheets(path)
for(i in sheetnames){
if(grepl("band", tolower(i))){
## I don't want these
next
}
assign(gsub("-|\\s", "_", i),
readxl::read_excel(path,
sheet = i)
)
}
coverages <- c("BI", "PD", "COLL", "COMP", "MP", "UMPD", "UMBI")
We will begin our pricing exercise by backing constructing credibility factors for Loss Cost, we do this by backing out of the Severity credibility factors they already have in the data to get coefficient of variation measures for each coverage. We use this and the raw data to create our credibility factors for Loss Cost. Then we create loss cost relativities for each ZIP code and each coverage, and then apply the credibility factors to get credibility weighting rating factors.
## REVERSE ENGINEER CREDIBILITY TO GET COEFFICIENT OF VARIATION AND DEFINE OUR OWN CREDIBILITY (using, now, Loss Cost instead of Frequency and Severity)
## from the documentation, we know the credibility that they used:
# alpha = .05
k = .1
p = .95
# giving:
# Z = abs(qnorm(p = (1-p)/2)) = 1.96
Z = 1.96
(cred_std_freq = round((Z/k)^2))
## [1] 384
## implied standard for full credibility for frequency as described in the pdf
(cred_std_freq = round((Z/k)^2))
## [1] 384
## test:
overall_loss_cost = sum(BI_SEVERITY$Losses) / sum(BI_SEVERITY$`Exposure Years`)
BI_SEVERITY %>%
mutate(coef_var_2 = (Credibility^2 / `Claim Count`*cred_std_freq)^-1) %>%
mutate(loss_cost_credibility = pmin( sqrt(`Claim Count`/(1+coef_var_2)/cred_std_freq) ,1),
loss_cost = Losses / `Exposure Years`,
credibility_weighted_loss_cost = loss_cost*(loss_cost_credibility) + (1-loss_cost_credibility)*overall_loss_cost
) %>%
mutate(rebased_credibility_weighted_loss_cost =
credibility_weighted_loss_cost /
max(ifelse( `Exposure Years` == max(`Exposure Years`), credibility_weighted_loss_cost, 0))
) %>%
head() %>%
print()
## # A tibble: 6 × 14
## Zipcode `Exposure Years` `Claim Count` Losses Severity Credibility
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 90001 92546 1260 10585915 8402. 1
## 2 90002 80854 1270 10946533 8619. 1
## 3 90003 107837 1791 15760116 8800. 1
## 4 90004 113526 2343 20579387 8783. 1
## 5 90005 53861 1285 11537249 8978. 1
## 6 90006 80815 1849 16095426 8705. 1
## # ℹ 8 more variables: `Severity Complement` <dbl>,
## # `Credibility-Weighted Severity` <dbl>, `Severity Band` <dbl>,
## # coef_var_2 <dbl>, loss_cost_credibility <dbl>, loss_cost <dbl>,
## # credibility_weighted_loss_cost <dbl>,
## # rebased_credibility_weighted_loss_cost <dbl>
print("^ Example of one of our tables: BI")
## [1] "^ Example of one of our tables: BI"
## DEFINE CREDIBILITY WEIGHTED FACTORS:
## from pdf:
severity_standards <- c("BI_SEVERITY" = 420, "COLL_SEVERITY" = 661, "COMP_SEVERITY" = 1487, "MP_SEVERITY" = 165, "PD_SEVERITY" = 214, "UMBI_SEVERITY" = 407, "UMPD_SEVERITY" = 185)
## implied coefficients of variation (squared)
severity_coefficients_of_variation_2 <- severity_standards / cred_std_freq
## loop over each coverage, right now each coverage has their own data frame
for(i in grep(pattern = "SEVERITY", ls(), value = TRUE) ){
## get coverage data
temp <- get(i)
## define overall loss cost for coverage
overall_loss_cost = sum(temp$Losses) / sum(temp$`Exposure Years`)
## there was a problem with the source data, fix here
temp <- tryCatch({
## try to rename this column if its spelled wrong, if not do nothing
temp %>% rename(Credibility = Credibilty)
}, error = function(e) temp)
## get loss cost (or, pure premium), its partial credibility factor, and credibility weighted loss cost factor
temp <-
temp %>%
## back out of the severity credibility factors they used
mutate(coef_var_2 = !!severity_coefficients_of_variation_2[i]) %>%
## define our credibility weighted factors, using credibility for pure premium/aggregate loss/loss cost
mutate(loss_cost_partial_credibility_UNCAPPED = sqrt(`Claim Count`/(1+coef_var_2)/!!cred_std_freq),
loss_cost_partial_credibility = pmin( loss_cost_partial_credibility_UNCAPPED ,1),
loss_cost = Losses / `Exposure Years`,
credibility_weighted_loss_cost =
loss_cost*(loss_cost_partial_credibility) +
overall_loss_cost * (1-loss_cost_partial_credibility)
) %>%
## only move forward with these columns
select(Zipcode, `Exposure Years`, `Claim Count`, Losses, credibility_weighted_loss_cost, loss_cost_partial_credibility_UNCAPPED)
## assign output data to new object
( assign_name <- gsub("_.*", "", i) )
assign(
assign_name,
temp)
rm(temp) ; rm(assign_name)
}
## using same looping structure, we want to put all coverages in the same dataframe
factors <-
lapply(X = grep(pattern = "SEVERITY", ls(), value = TRUE),
FUN = function(x){
## get variables that we created in previous loop
x <- gsub("_.*", "", x)
## get data frame
temp <- get(x)
temp <-
temp %>%
## format column names
rename_with(.cols = colnames(.)[-1],
.fn = ~paste0(x, "_", gsub(" ", "_", .))) %>%
## go with better name for pricing factor
rename_with(.cols = contains("credibility_weighted_loss_cost"),
.fn = ~paste0(x, "_pricing_factor"))
return(temp)
}
) %>%
Reduce(f = function(x, y) left_join(x, y, by = "Zipcode"))
factors %>%
head() %>%
print()
## # A tibble: 6 × 36
## Zipcode BI_Exposure_Years BI_Claim_Count BI_Losses BI_pricing_factor
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 90001 92546 1260 10585915 114.
## 2 90002 80854 1270 10946533 135.
## 3 90003 107837 1791 15760116 146.
## 4 90004 113526 2343 20579387 181.
## 5 90005 53861 1285 11537249 214.
## 6 90006 80815 1849 16095426 199.
## # ℹ 31 more variables: BI_loss_cost_partial_credibility_UNCAPPED <dbl>,
## # COLL_Exposure_Years <dbl>, COLL_Claim_Count <dbl>, COLL_Losses <dbl>,
## # COLL_pricing_factor <dbl>,
## # COLL_loss_cost_partial_credibility_UNCAPPED <dbl>,
## # COMP_Exposure_Years <dbl>, COMP_Claim_Count <dbl>, COMP_Losses <dbl>,
## # COMP_pricing_factor <dbl>,
## # COMP_loss_cost_partial_credibility_UNCAPPED <dbl>, …
You’ll notice that alot of our factors are huge, having pricing multipliers with large orders of magnitude. This is fine, because actuaries will set their rates such that once we multiply all of the possible pricing factors together (for example, ZIP code, vehicle age, gender), we will multiply the resulting coverage premium by a final “base rate”, which is the same for all policies and is determined in order to appease some targeted average premium that we want to charge all of our customers. But this is still a little messy, so…
Finally we “rebase” these factors - because we will be multiplying our prices by these factors, it’s sensible for them to have a mean value roughly around 1. We will find the ZIP code that has the most exposures, and set that as our “base level”, which is the ZIP code that we will set factors to be 1, by dividing all factors by that ZIP code’s factor. It’s sensible to have your most populous ZIP code as the unity level, this makes other pricing actions that have to happen later on (such as determining the average price we want to target for our book of business) a much easier and clearer exercise.
In particular, we will use sum of total BI and PD exposures (in each ZIP code) to set our base level, since all policies must have these coverages. The data is a little inconsistent, total exposures for BI and PD should be identical, but they’re not - so we will use the sum of BI and PD exposures to set our base level instead.
Find ZIP code that should be our base level, and rebase factors:
## REBASE FACTORS
#### TODO: INCLUDE?
# ## get rid of claim counts, and exposures for all coverages except for BI and PD
# factors <-
# factors %>%
# # select(-(contains("Losses") & !(starts_with("BI") | starts_with("PD")))) %>%
# select(-(contains("Claim") & !(starts_with("BI") | starts_with("PD")))) %>%
# select(-(contains("Exposure_Years") & !(starts_with("BI") | starts_with("PD"))))
factors <-
factors %>%
### perform our actuarial "rebasing"
## set base level as ZIP code with most BI and PD exposures
mutate(BI_PD_exposures = BI_Exposure_Years + PD_Exposure_Years) %>%
## largest exposure ZIP: 90650 Norwalk's ZIP code, a suburb of Los Angeles
mutate(across(.cols = ends_with("pricing_factor"),
## wrap in `max` to get value for entire column
.fns = ~. / max(ifelse(BI_PD_exposures == max(BI_PD_exposures), ., -1))
)) %>%
### Process data
## some factors are NaN, because they had no information. Set these as 1.0's.
mutate(across(.cols = ends_with("pricing_factor"),
.fns = ~ifelse(is.na(.), 1.0, .)
)) %>%
## these come from where loss amounts are 0. We want to pay attention to these later on to identify buckets where there is no exposure for our clustering. Set NA's to 0 as well.
mutate(across(.cols = contains("Losses") | contains("Exposure") | contains("credibility"),
.fns = ~ifelse(is.na(.), 0, .)
)) %>%
## dont need this column anymore
select(-BI_PD_exposures)
## here are our rebased pricing factors
factors %>%
select(Zipcode, contains("pricing_factor")) %>%
mutate(across(.cols = everything(),
.fns = ~round(., 3))) %>%
head()
## # A tibble: 6 × 8
## Zipcode BI_pricing_factor COLL_pricing_factor COMP_pricing_factor
## <dbl> <dbl> <dbl> <dbl>
## 1 90001 1.24 1.28 1.60
## 2 90002 1.46 1.32 1.65
## 3 90003 1.58 1.37 1.60
## 4 90004 1.96 1.60 0.991
## 5 90005 2.32 1.87 0.95
## 6 90006 2.16 1.62 0.886
## # ℹ 4 more variables: MP_pricing_factor <dbl>, PD_pricing_factor <dbl>,
## # UMBI_pricing_factor <dbl>, UMPD_pricing_factor <dbl>
Now we are done finding our pricing factors. Let’s take a look at them.
Here, we bring in ZIP code data so that we can make fancy plots of our factors. California ZIP codes were updated in 2008, our data from the CA DOI is from 2008 as well. The next time they were updated is 2011. 2008 errors with the ZIP code pull here, so we use 2010.
Here, in our factors, there are a few ZIP codes that are missing (when comparing to the ZIP codes we bring in), for the sake of our analysis as an exercise, we disregard this and move forward - but if I were to do this rigorously, I would assign those missing values to have a factor of 1 (this is somewhat of a standard in the actuarial profession).
## GET FACTOR QUANTILES
lapply(X = factors %>% select(contains("pricing_factor")),
FUN = function(x) quantile(x, seq(0,1,.1)))
## $BI_pricing_factor
## 0% 10% 20% 30% 40% 50% 60% 70%
## 0.4700292 0.6645065 0.7270029 0.7683529 0.7997367 0.8234662 0.8427087 0.8661307
## 80% 90% 100%
## 0.9424607 1.1161033 2.6960589
##
## $COLL_pricing_factor
## 0% 10% 20% 30% 40% 50% 60% 70%
## 0.5460727 0.7867820 0.8389797 0.8789619 0.9114947 0.9381073 0.9645061 0.9991634
## 80% 90% 100%
## 1.0568920 1.1897153 2.8121856
##
## $COMP_pricing_factor
## 0% 10% 20% 30% 40% 50% 60% 70%
## 0.3805556 0.6795711 0.8141864 0.8958753 0.9462889 0.9882475 1.0374196 1.1074968
## 80% 90% 100%
## 1.1978270 1.3731559 6.0943758
##
## $MP_pricing_factor
## 0% 10% 20% 30% 40% 50% 60% 70%
## 0.3950909 0.5701002 0.6385015 0.6841512 0.7169781 0.7372946 0.7580579 0.7866343
## 80% 90% 100%
## 0.8750296 1.0936414 8.0185869
##
## $PD_pricing_factor
## 0% 10% 20% 30% 40% 50% 60% 70%
## 0.4497701 0.7151960 0.7872938 0.8457665 0.8821812 0.9104185 0.9443592 0.9860746
## 80% 90% 100%
## 1.0393800 1.1475321 2.6648002
##
## $UMBI_pricing_factor
## 0% 10% 20% 30% 40% 50% 60% 70%
## 0.5057186 0.6114698 0.6449859 0.6696101 0.6839987 0.6945086 0.7131241 0.7212141
## 80% 90% 100%
## 0.7813482 0.9373497 3.5687426
##
## $UMPD_pricing_factor
## 0% 10% 20% 30% 40% 50% 60% 70%
## 0.6142640 0.6994738 0.7214615 0.7393025 0.7511482 0.7653743 0.7800453 0.7827640
## 80% 90% 100%
## 0.8294023 0.9362045 2.1217486
## PLOT FACTOR DISTRIBUTIONS
suppressWarnings(
factors %>%
select(contains("pricing_factor")) %>%
pivot_longer(cols = everything(),
names_to = "coverage",
values_to = "pricing_factor") %>%
ggplot(aes(x = pricing_factor)) +
geom_histogram(bins = 100) + facet_wrap(~coverage) +
## impose a limit to make the plots prettier
xlim(c(0,2.5))
)
## Warning: Removed 45 rows containing non-finite outside the scale range
## (`stat_bin()`).
## Warning: Removed 14 rows containing missing values or values outside the scale range
## (`geom_bar()`).
### READ IN ZIP CODES, MAKE PLOTTING DATA
ca_zips <- suppressMessages(
tigris::zctas(year = 2010, cb = TRUE, starts_with = c("90", "91", "92", "93", "94", "95", "96"),
progress_bar = FALSE)
)
ca_state <- tigris::states(cb = TRUE, year = 2010, progress_bar = FALSE) %>%
filter(NAME == "California")
ca_zips <-
ca_zips %>%
mutate(ZCTA5 = as.numeric(ZCTA5)) %>%
filter(ZCTA5 > 90000 & ZCTA5 <= 96162) ## range of CA zips
ca_zips$ZCTA5 %>% unique %>% length()
## [1] 1763
factors$Zipcode %>% unique %>% length()
## [1] 1819
factors$BI_pricing_factor %>% quantile(seq(0,1,.1))
## 0% 10% 20% 30% 40% 50% 60% 70%
## 0.4700292 0.6645065 0.7270029 0.7683529 0.7997367 0.8234662 0.8427087 0.8661307
## 80% 90% 100%
## 0.9424607 1.1161033 2.6960589
factors$BI_pricing_factor %>% hist()
log(factors$BI_pricing_factor) %>% hist()
factors_plot <-
ca_zips %>%
inner_join(
factors,
by = c("ZCTA5" = "Zipcode")
)
# factors_plot %>% head()
View full map of plots
## FULL CALIFORNIA MAP
coverages <- factors_plot %>% as.data.frame() %>% select(contains("pricing_factor")) %>% colnames()
plots <- list()
img_paths <- c()
if(!exists("plots/")) dir.create("plots/")
## Warning in dir.create("plots/"): 'plots' already exists
for(i in coverages){
p_full <- ggplot() +
geom_sf(data = factors_plot, aes(fill = !!sym(i)), color = "black", linewidth = 0.05) +
scale_fill_gradient2(low = "blue", mid = "gray90", high = "red",
midpoint = 1,
limits = range(factors_plot[[i]]),
na.value = "white") +
theme_void() +
theme(legend.position = "none") +
ggtitle(i)
# LA zoom map
p_LA <- ggplot() +
geom_sf(data = factors_plot, aes(fill = !!sym(i)), color = "black", linewidth = 0.05) +
scale_fill_gradient2(low = "blue", mid = "gray90", high = "red",
midpoint = 1,
limits = range(factors_plot[[i]]),
na.value = "white") +
coord_sf(xlim = c(-119, -117.5), ylim = c(33.5, 34.25)) +
theme_void() +
ggtitle("Los Angeles")
# SF zoom map
p_SF <- ggplot() +
geom_sf(data = factors_plot, aes(fill = !!sym(i)), color = "black", linewidth = 0.05) +
scale_fill_gradient2(low = "blue", mid = "gray90", high = "red",
midpoint = 1,
limits = range(factors_plot[[i]]),
na.value = "white", guide = NULL) +
coord_sf(xlim = c(-122.75, -121.65), ylim = c(37.3, 38.3)) +
theme_void() +
ggtitle("San Fran")
## plot together
p <-
ggdraw() +
draw_plot(p_full) + # Base map
draw_plot(p_LA, x = 0.55, y = 0.55, width = 0.4, height = 0.4) +
draw_plot(p_SF, x = 0.65, y = 0.3, width = 0.25, height = 0.25)
rm(p_full) ; rm(p_LA) ; rm(p_SF)
img_file <- paste0("plots/", i, ".png")
ggsave(img_file, plot = p, width = 8, height = 6, dpi = 150)
img_paths <- c(img_paths, img_file)
} ; rm(coverages)
slickR(img_paths)
\[\space\] \[\space\] \[\space\] \[\space\]
We see that primarily LA, and also some ZIP codes in SF account for most of the factors above 1, this seems sensible - most accidents (and therefore claims) happen to people living in densely populated areas with lots of traffic.
At this point, in our analysis, we would be done and have our final pricing factors for each ZIP code.
\[\space\] \[\space\] \[\space\]
So instead of a table that looks like this:
factors %>% select(Zipcode, contains("pricing_factor"))
## # A tibble: 1,819 × 8
## Zipcode BI_pricing_factor COLL_pricing_factor COMP_pricing_factor
## <dbl> <dbl> <dbl> <dbl>
## 1 90001 1.24 1.28 1.60
## 2 90002 1.47 1.32 1.65
## 3 90003 1.58 1.37 1.60
## 4 90004 1.96 1.60 0.991
## 5 90005 2.32 1.87 0.950
## 6 90006 2.16 1.62 0.886
## 7 90007 1.75 1.56 0.959
## 8 90008 1.61 1.42 1.02
## 9 90010 1.66 1.83 1.00
## 10 90011 1.44 1.29 1.43
## # ℹ 1,809 more rows
## # ℹ 4 more variables: MP_pricing_factor <dbl>, PD_pricing_factor <dbl>,
## # UMBI_pricing_factor <dbl>, UMPD_pricing_factor <dbl>
with 1819 rows, we instead need one that has 150 rows (where the
Zipcode column would become a group id that has some group
of these ZIP codes)
Here, we can consider our ZIP code factors to be the “perfect” price, and we have to find a way to assign ZIP codes to 150 groupings (territories) in such a way that the factors assigned for each coverage in that territory results in the least amount of error; we want to retain as much of the information in the original factors at the end of section A as possible. This is exactly a dimensionality-reduction problem. I identified that k-means clustering is the perfect tool for this problem.
I will assume the reader has a decent understanding of the traditional k-means clustering algorithm.
\[\space\]
In our case here, we can see why that may not be ideal. Some ZIP codes will have a much higher population and therefore much more exposure, we would wont these to have more of a say in the factor that gets assigned to its cluster’s final price. Also, some coverages are more important than others (since, for example, BI and PD are always chosen, but UMBI may be rarely chosen), so we would want these to be more determinant of how ZIP codes get assigned to clusters.
We solve these problems with some tweaks to our algorithm. To solve our first (1) problem, we use a “weighted” k-means clustering algorithm instead of a standard k-means clustering - this uses weights assigned to each ZIP code to weight each observation’s influence on assigning observations to clusters and computing the cluster centroid (which will be our new pricing factor). To solve our second (2) problem, I devised a clever and novel hack on the k-means clustering algorithm (which assumes input columns’ means and variances are scaled to 0 and 1, respectively) that scales the variances based on how much importance we want to give each column (coverage). I will give my mathematical argument for why this works below, but to bring this all together; I find that the factor by which we scale a column’s variance coincides one-to-one with the importance/influence that column has on determining where ZIP codes are grouped. That is, if our importance scales are 1 and 2, then the first column will have importance 1/(1+2) and the other will have importance 2/(1+2) - the second column will have 2x as much importance as the first!
Before addressing our 2 considerations, we have to prepare our data to get it to the point where it is ready for a traditional k-means clustering algorithm.
K-means assumes that each of its variables are isotropic (think, spherical), in order to achieve this, we “standardize” our data, scaling it so that it has mean 0 and variance 1 (using the Central Limit Theorem, it becomes (roughly) a standard normal distribution).
This assumption is rooted in the fact that the algorithm finds clusters that are spherical in hyper space, and is partially a consequence of the use of Euclidean Distance in the heart of its objective function. To perform optimally, and create results you would expect from the algorithm, the input data has to have clusters that possess this quality. (Though later you will see that we take advantage of this construction to relax this assumption and make an even more powerful algorithm).
Scale our data:
## STANDARDIZE EACH OF OUR VARIABLES
## Save these original means & variances to back out of this transformation after fitting k-means
original_scales_ <-
factors %>%
summarize(across(.cols = contains("pricing_factor"),
.fns = tibble::lst(mean, sd),
.names = "{fn}_{col}"
))
factors_scaled <-
factors %>%
mutate(across(.cols = contains("pricing_factor"),
.fns = ~(. - mean(.)) / sd(.)
))
## View
factors_scaled %>%
pivot_longer(cols = contains("pricing_factor"), names_to = "coverage", values_to = "pricing_factor") %>%
# filter(!Losses %in% c(0, NA)) %>%
mutate(coverage = gsub("_.*", "", coverage)) %>%
ggplot(aes(x = pricing_factor, fill = coverage)) + geom_density(alpha = .5) +
ggtitle("Distribution of factors should be roughly standard normal now", subtitle = "(You can see that they are)")
It’s sensible to consider that our weights for each ZIP code should be determined by some type of size metric - for example, the size of the population, or, number of policies written (roughly, exposure). Here, we can either choose exposure or claim count.
First, we want to ask, should we use an element (say, exposure) pertaining to a single coverage, or some element that represents an aggregation of all coverages, within a ZIP code?
Consider exposure. Exposure for a single policy should be either 1 or 0 for each coverage. For BI and PD, it will always be 1 since they are compulsory coverages. I’ll assume a constant distribution of selections of coverages per policy within each ZIP code. But when we dive into the data we see this is actually not the case (to a big degree), but it would drastically simplify our decision making.
#### LET'S VIEW THIS ^ ANOMALY
## distribution of relative exposure within each ZIP code
factors_scaled %>%
rowwise() %>%
transmute(across(.cols = contains("Exposure"),
.fns = ~pmin(. / PD_Exposure_Years, 1))) %>% ## use pmin because there are ~3 ZIP's that have COMP and COLL exposures higher than PD exposures (I assume this to be an error, and it doesn't affect our analysis anywhere else (except for credibility constants, but in those cases, the COMP and COLL exposures were probably already high enough to get a credibility compliment of 1 - I could revisit this in section A but won't bother, for the sake of brevity (I'd have to make some sort of assumption and selection regardless)))
ungroup() %>%
as.data.frame() %>%
pivot_longer(cols = everything(),
names_to = "coverage",
values_to = "exposure") %>%
mutate(coverage = gsub("_Exposure_Years", "", coverage)) %>%
ggplot(aes(x = exposure, y = coverage, fill = coverage)) +
geom_density_ridges(stat = "binline", bins = 100, scale = 1, draw_baseline = F, alpha = 0.7) +
labs(title = "Distribution of relative exposure within each ZIP code",
x = "Exposure Years", y = "Coverage") +
theme(legend.position = "none")
## seeing flat dist's right off the bat is pretty much the opposite of what we wanted to see
### We'd like to see them centered around the same point (except for BI and PD), but they definitely don't have the same centers; this confirms what we were saying above.
Given this, if we use the sum of all exposures as our weight (instead of just one coverage’s exposures), it exacerbates violation of this assumption (when the distributions change), and we don’t want to price for that because credibility already accounts for this when dist of exposures across coverages are less than expected!
Diving into this, if we have, say, a typical distribution of exposures across coverages, and we cluster that observation with something that has distribution “higher” than typical, then the higher than typical observation will double count and pull the other observations even further into it in the clustering (i.e. it will have more importance), our credibility weighting of factors in section A already does this! (“Higher” than typical distribution means that the distribution of how often optional coverages are selected relative to compulsory coverages increases.) On the other hand, if we have a typical distribution of exposures across coverages, and we cluster that observation with something that has distribution lower than usual, then the same will happen, credibility will be double counted (now in the opposite direction).
Thus, it is better to focus on a weighting element that pertains to one selected coverage to represent the weight for that entire ZIP code, rather than something that aggregates across all coverages, because aggregating somewhat double counts the credibility weighting used before. But it is still a drawback that a single coverage has to account for all of the coverages in the weighting, when you consider that these distributions for exposures are far from being constant within each ZIP code.
This same problem of aggregating exists for claim counts, but actually, this
This same problem, and what we’ve identified, can be said if we replace exposure with claim count.
So, let us now consider claim count, it, unlike exposure, isn’t 0 or 1 for a policy anymore. We can also assume that coverages’ claim frequencies have same distribution across zips like we do for exposures, BUT we can see from the data that this is actually a far better assumption than assuming that exposures per policy follow the same distribution for each coverage. For this reason, we are comfortable with using claim count instead of exposure for the weights! And this is whether or not we wanted to aggregate claim count across coverages or just use one coverage’s claim count as the weight, though we already identified that aggregating is undesirable regardless of measure (exposure or claim count).
So we will use claim count instead of exposure. But we want to standardize claim count in such a way that we can cap it (we don’t want the variance of weights to be too high, then the clusters would be too dominated by highly weighted ZIP codes), this is exactly what our credibility factors (not to be confused with pricing factors) do! An added bonus of our credibility factors is they are capped at 1, which is when there are enough claims such that the reliability of the frequency distribution is deemed to be fully representative of the population - it makes sense that we would want to treat all fully credible ZIP codes the same.
## use `factors$PD_loss_cost_partial_credibility_UNCAPPED` as our weights
## it is (uncapped) credibility for loss cost, but is a function of claim counts
## uncapped credibility measure allows for full dimension across weights, rather than capping them and treating observations above the capping value as the same
(factors$PD_loss_cost_partial_credibility_UNCAPPED == 0) %>% sum(na.rm = T) ## and drop 7 observations
## [1] 7
hist(factors$PD_loss_cost_partial_credibility_UNCAPPED, main = "Histogram of weights")
Let’s define:
Then: \[\text{Var}(x_{ij}^\text{(scaled)}) = a_j \cdot \text{Var}(x_{ij}^{\text{(original)}})\]
The within-cluster sum of squared distances (WCSS) for weighted k-means clustering is
\[\text{WCSS} = \sum^K_{k=1} \sum_{i \in C_k} w_i \sum^d_{j=1} (x_{ij} - \mu_{kj})^2 \]
This is the same as computing a weighted sum of squared distances.
Now we plug in the fact that \(x_{ij} = \sqrt{a_j} \cdot z_{ij}\) where \(z_{ij}\) is the original standardized data. Then, factoring:
\[\text{WCSS}^{(w)} = \sum^K_{k=1} \sum_{i \in C_k} w_i \sum^d_{j=1} (\sqrt{a_j} \cdot z_{ij} - \sqrt{a_j} \cdot \bar{z}^{(w)}_{kj} )^2 = \sum^d_{j=1} a_j \Bigg[ \sum^K_{k=1} \sum_{i \in C_k} w_i \cdot (z_{ij} - \bar{z}^{(w)}_{kj} )^2 \Bigg]\]
Now define:
\[\text{WCSS}^{(w)}_j = \sum^K_{k=1} \sum_{i \in C_i} w_i \cdot (z_{ij} - \bar{z}^{(w)}_{kj})^2 \]
Where \(WCSS_j\) is the within-cluster sum of squares for feature \(j\) on the standardized scale.
This equation:
\[ \text{WCSS}^{(w)} = \sum^d_{j=1} a_j \cdot \text{WCSS}_j^{(w)} \]
Shows that: * The total objective is a linear combination of the variance scalars \(a_j\) * The The weights in this linear combination — WCSS_\(j\) - depend only on the data and clustering. * Therefore, scaling variance by \(a_j\) increases that feature’s contribution to the loss linearly, independently, and proportionally.
Therefore, the scalars we multiply the variances by are not arbitrary: they are direct, one-to-one controls over how much each dimension shapes the space, the clusters, and ultimately, the optimization landscape of k-means.
In application, if our variance scales that we apply to each column are some vector, \(\vec{a}\), then the importance for column \(i\) is expressed as \(\frac{a_i}{\sum{a}}\), or, equivalently, the importances are expressed by the ratios \(a_1 : a_2 : \space ... \space : a_j\).
Note that multiplying a random variable by scalar \(a\) scales variance by a factor of \(a^2\), so in practice we construct our variance scales \(a\), then multiply the columns, for their corresponding j, by \(\sqrt{a_j}\) to scale their variance by \(a\).
Now that we know that we can exploit this nice quality of the k-means clustering algorithm, we must choose our scales. Because our pricing factors are constructed to represent the relativistic change in loss cost, we will use average loss costs by coverage as the scales. We will see that some coverages have loss costs that far out-weigh others, this is due to disproportionate severities and frequencies. If we use the ratio of each coverage’s total loss costs as our scales, then some small coverages would be drowned out (like UMPD), so we make an “actuarial selection” and set a lower bound that these variance scales can have at 10% of the highest variance scale.
Implement:
## DEFINE VARIANCE SCALES
scales_ <-
factors %>%
## get each coverages total loss cost
summarize(across(.cols = contains("Losses"),
.fns = ~sum(., na.rm = TRUE) /
sum(get(gsub("_Losses", "_Exposure_Years", cur_column())), na.rm = T))) %>%
## divide each coverage loss cost by maximum loss cost
mutate(across(.cols = everything(),
.fns = ~. / max(across(everything())))) %>%
as.data.frame() %>%
## actuarial judgement: UMPD and some other small coverages are so small, don't let them get drowned out
## set minimum scale ratio to be 10% of largest coverage's scale
mutate(across(.cols = everything(),
.fns = ~pmax(., 0.1))) %>%
## rescale so that smallest variance scale factor is 1
mutate(across(.cols = everything(),
.fns = ~. / min(across(everything()))))
print(scales_)
## BI_Losses COLL_Losses COMP_Losses MP_Losses PD_Losses UMBI_Losses UMPD_Losses
## 1 3.799112 10 2.708612 1 4.376621 1 1
## ^ THESE ARE OUR SELECTED VARIANCE SCALES
\[\space\]
## DEFINE CLUSTERING INPUT DATA (must be a matrix of our pricing factors)
input <-
factors_scaled %>%
select(contains("pricing_factor")) %>%
as.matrix()
input_cols <- colnames(input)
input %>% nrow()
## [1] 1819
## APPLY OUR VARIANCE SCALES (these scales represent what we want our columns' variances to be)
## we apply the sqrt of the scales to the variable, this mathematically makes the new variances equal to our `scales_`
input <-
t( t(input) * sqrt(c(as.matrix(scales_))) )
# %>% as.data.frame %>% summarize(across(.fns = var)) ## <- use this to check if variances were applied properly
# print(scales_)
## APPLY WEIGHTS (this is done when fitting)
## Some are zero (where ZIP codes had no claims)...
# which(factors_scaled$PD_loss_cost_partial_credibility_UNCAPPED == 0)
sum(factors_scaled$PD_loss_cost_partial_credibility_UNCAPPED == 0)
## [1] 7
Finally, we see that some observations (ZIP codes) have weight = 0. We will exclude them from our analysis, and say that they simply wont be included in our pricing. That is, for completeness, actuarially, we could say for this project, for simplicity’s sake: these ZIP codes are not offered in our product due to zero pricing information, and instruct Underwriters to not accept any policies in these ZIP codes.
## I iterated through multiple options of cluster amounts. New York requires that **at most** 150 territories are used, it never said you had to use exactly 150.
# ## fit k-means based on different cluster amounts k
# for(i in 145:150){
# temp <-
# SWKM::kmeans.weight(x = input,
# weight = factors$PD_loss_cost_partial_credibility_UNCAPPED,
# K = i,
# nstart = 100)
# assign( paste0("clustering_", i),
# temp )
# }
#
# ## compare results
# for(i in 145:150){
# print(paste0("num territories = ", i))
# temp <- get( paste0("clustering_", i) )
# print(temp$wcss)
# print("------------")
# }
## best performers: 150 and 145.
## we can compare `wcss` because it is summing weighted distance (from observation to centriod) over all observations - the basis is unchanged by number of clusters, $k$, chosen.
## ^ After doing this, it was determined that the best option is in fact 150 territories
### RUN OUR K-MEANS CLUSTERING ALGORITHM
## RUN WEIGHTED K-MEANS CLUSTERING
## this algorithm automatically excludes observations that have weight = 0
set.seed(28)
clustering_150 <-
SWKM::kmeans.weight(x = input,
weight = factors$PD_loss_cost_partial_credibility_UNCAPPED,
K = 150,
nstart = 100)
## (run the exact same k-means clustering algorithm, only don't use any weights)
## save this for later...
set.seed(28)
clustering_150_no_weight <-
SWKM::kmeans.weight(x = input,
weight = rep(1, nrow(input)),
K = 150,
nstart = 100)
set.seed(NULL)
## PREPARE FOR PLOTTING+
## join our new factors onto their ZIP codes and also the geography objects we need to make these nice plots
factors_plot <-
ca_zips %>%
## right join because we want to keep `class` attribute "sf" for `ca_zips` in result,
## but we want to left join onto ZIP codes in `factors_scaled`
right_join(
factors_scaled,
by = c("ZCTA5" = "Zipcode")
) %>%
filter(PD_loss_cost_partial_credibility_UNCAPPED != 0) %>%
cbind(
data.frame(
cluster_id = clustering_150$cluster[,1]
)
) %>%
group_by(cluster_id) %>%
mutate(cluster_size = n()) %>%
ungroup() %>%
arrange(desc(cluster_size)) %>%
mutate(cluster_size_cum_desc = 1,
cluster_size_cum_desc = cumsum(cluster_size_cum_desc)) %>%
group_by(cluster_id) %>%
mutate(cluster_size_cum_desc = max(cluster_size_cum_desc)) %>%
ungroup() %>%
arrange(cluster_size_cum_desc) %>%
mutate(cluster_index_desc_sort = data.table::rleid(cluster_id))
factors_plot_temp <-
factors_plot %>%
filter(cluster_size_cum_desc <= 400) %>%
mutate(cluster_id = as.character(cluster_id))
helper_temp <-
factors_plot_temp %>%
as.data.frame() %>%
select(cluster_size, cluster_id) %>%
unique()
legend_labels <- setNames(paste0(as.character(helper_temp$cluster_id), ": ", as.character(helper_temp$cluster_size)),
nm = as.character(helper_temp$cluster_id)
)
factors_plot_temp %>%
ggplot() +
geom_sf(aes(fill = cluster_id),
color = "black", linewidth = 0.05) +
scale_fill_manual(
values = scales::hue_pal()(nrow(helper_temp)),
labels = legend_labels,
name = "cluster ID: cluster size"
) +
geom_sf(data = ca_state, fill = NA, color = "black", linewidth = 0.5) +
ggtitle("7 Biggest clusters")
These are hard to view. If I plotted a map with all 150 clusters, there wouldn’t be enough colors or even patterns to make any sense of the clustering. Note that it doesn’t necessarily matter that the ZIP codes within each cluster are far apart from eachother geographically, what matters is that the algorithm found the optimal cluster to reduce the error in the reassignment of their factors (we can be certain k-means does this, with sufficient number of starting points).
We could have tried a number of clusters far below 145, in hopes that it would produce a result with less error (WCSS), but that is somewhat unlikely. I would leave that as something extra to be tested for future iterations of this project.
Because these are hard to view, we will focus on the error metric itself, and try to make sense of how good of a job our clustering algorithm did.
\[\space\]
## WRANGLE CLUSTERING RESULTS BACK INTO OUR USUAL FACTORS DATAFRAME FORMAT.
## conserve code, use a function
get_factors <- function(CLUSTERING_RUN_, ORIGINAL_FACTORS_, weights_ind, COLS_ = input_cols, SCALES_ = scales_, ORIGINAL_SCALES_ = original_scales_){
## get vector of each row in our input df's corrsponding assigned cluster id
CLUSTERING_RUN_$cluster %>%
as.data.frame() %>%
`colnames<-`("cluster_id") %>%
## join on the cluster centroids
left_join(
CLUSTERING_RUN_$centers %>% as.data.frame() %>% rowid_to_column(),
by = c("cluster_id" = "rowid")
) %>%
## join on ZIP codes
cbind(
ORIGINAL_FACTORS_ %>%
## correct for where algorithm didn't price because there were ZIP's with no exposure
## (when weight = 0)
filter(PD_loss_cost_partial_credibility_UNCAPPED != !!weights_ind) %>%
select(Zipcode)
) %>%
## format; add column names
relocate(Zipcode, before = colnames(.)[1]) %>%
`colnames<-`(c("Zipcode", "cluster_id", COLS_)) %>%
## undo scaling variances
tidyr::nest(
## nesting everything optimizes performance. Adding this allows nested data to be all rows
ids = !contains("pricing_factor"),
## so we add that ^ even tho we only wanna perform operations on this :
data = contains("pricing_factor")) %>%
mutate(data = purrr::map(data,
~as.data.frame(
t( t(as.matrix(.)) * (sqrt(c(as.matrix( SCALES_ ))))^-1 )
)
)) %>%
## undo standardizing
mutate(data = purrr::map(data,
~as.data.frame(
t( t(as.matrix(.)) *
unlist(
rename_with(select(ORIGINAL_SCALES_,
contains("sd_")),
.fn = ~gsub("sd_", "", .)))
)
)
)) %>%
mutate(data = purrr::map(data,
~as.data.frame(
sweep(
x = as.matrix(.),
MARGIN = 2,
STATS = unlist( rename_with(select( ORIGINAL_SCALES_ , contains("mean_")), .fn = ~gsub("mean_", "", .)) ),
FUN = `+`
)
)
)) %>%
## finish
tidyr::unnest(cols = everything())
}
## create factors data frame for our weighted k-means clustering algorithm run
clustered_factors <-
get_factors( clustering_150, factors, weights_ind = 0 )
## create factors data frame for our UN-WEIGHTED k-means clustering algorithm run (more on this one later...)
clustered_factors_no_weight_in_algo <-
get_factors( clustering_150_no_weight, factors, weights_ind = -1 )
print( clustered_factors )
## # A tibble: 1,812 × 9
## Zipcode cluster_id BI_pricing_factor COLL_pricing_factor COMP_pricing_factor
## <dbl> <int> <dbl> <dbl> <dbl>
## 1 90001 145 1.06 1.15 1.71
## 2 90002 95 1.55 1.33 1.46
## 3 90003 95 1.55 1.33 1.46
## 4 90004 51 1.83 1.57 1.06
## 5 90005 142 2.13 1.79 1.13
## 6 90006 142 2.13 1.79 1.13
## 7 90007 51 1.83 1.57 1.06
## 8 90008 109 1.54 1.36 1.21
## 9 90010 114 1.66 1.80 0.997
## 10 90011 95 1.55 1.33 1.46
## # ℹ 1,802 more rows
## # ℹ 4 more variables: MP_pricing_factor <dbl>, PD_pricing_factor <dbl>,
## # UMBI_pricing_factor <dbl>, UMPD_pricing_factor <dbl>
## JOIN NEW FACTORS ONTO OLD FACTORS
## bring in our original factors from the end of Section A
original_factors <-
factors %>%
filter(PD_loss_cost_partial_credibility_UNCAPPED != 0) %>%
## for the remainder of the analysis, we only need these columns
select(Zipcode, contains("factor"))
## join the factors before and after Section B together in these objects
factors_compare <-
clustered_factors %>%
rename_with(.cols = colnames(.)[-1],
.fn = ~paste0(., "_clustered")) %>%
left_join(
original_factors %>%
rename_with(.cols = colnames(.)[-1],
.fn = ~paste0(., "_original")),
by = "Zipcode"
)
factors_plot2 <-
right_join(
x = ca_zips,
y =
( select(clustered_factors, -cluster_id) - original_factors ) %>%
mutate(Zipcode = clustered_factors$Zipcode),
by = c("ZCTA5" = "Zipcode")
)
factors_compare %>%
mutate(
weight =
(factors %>%
filter(PD_loss_cost_partial_credibility_UNCAPPED != 0) %>%
pull(PD_loss_cost_partial_credibility_UNCAPPED))
) %>%
pivot_longer(cols = contains("factor")) %>%
mutate(name = gsub("pricing_factor_", "", name)) %>%
separate(name, into = c("coverage", "measure"), sep = "_") %>%
pivot_wider(names_from = measure, values_from = value) %>%
mutate(abs_diff = abs(clustered - original),
squared_diff = (clustered - original)^2) %>%
group_by(coverage) %>%
summarize(w_mae = round(weighted.mean(abs_diff, weight),4),
mae = round(mean(abs_diff),4),
diff_mae = w_mae - mae,
w_mse = round(weighted.mean(squared_diff, weight),4),
mse = round(mean(squared_diff),4),
diff_mse = w_mse - mse
)
## # A tibble: 7 × 7
## coverage w_mae mae diff_mae w_mse mse diff_mse
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 BI 0.0384 0.0354 0.00300 0.0028 0.0026 0.000200
## 2 COLL 0.0262 0.0244 0.0018 0.0014 0.0012 0.000200
## 3 COMP 0.0737 0.0713 0.0024 0.0105 0.01 0.000500
## 4 MP 0.103 0.0853 0.0181 0.0276 0.0213 0.0063
## 5 PD 0.0331 0.0325 0.000600 0.0023 0.0025 -0.000200
## 6 UMBI 0.049 0.0421 0.0069 0.0055 0.0043 0.0012
## 7 UMPD 0.04 0.0331 0.0069 0.0036 0.0027 0.0009
These results look good on the surface. Mean absolute error (MAE) for our large coverages (like COLL and PD) are very low (if a factor differs on average by (for PD) $$0.033, that seems like a great start. The fact that errors are lower for our large coverages shows that the re-scaling worked to successfully implement our importance scales to each coverage as well. I like to consult mean absolute error metrics instead of mean squared error because it’s easier to interpret and relate to our use case (i.e., this is how far away on average our factors move after clustering), but do note that (weighted) mean squared error is what the algorithm uses in its objective function (for determining cluster assignments and centroids).
When we dive into these metrics, a glaring problem arises; this is a big problem we will have to investigate: the weighted mean error metrics are worse than the raw mean error metrics, showing that our weighted k-means was unable to give more importance to the ZIP codes with higher weight. That is, the observations with higher weights had their factors changed more than those with smaller weights, which is the opposite of what we expected to see, when you consider that the algorithm attempts to accomplish this with its objective function.
Let’s try to diagnose this problem by inspecting our data:
p <-
factors_compare %>%
mutate(
weight =
(factors %>%
filter(PD_loss_cost_partial_credibility_UNCAPPED != 0) %>%
pull(PD_loss_cost_partial_credibility_UNCAPPED))
) %>%
pivot_longer(cols = contains("factor")) %>%
mutate(name = gsub("pricing_factor_", "", name)) %>%
separate(name, into = c("coverage", "measure"), sep = "_") %>%
pivot_wider(names_from = measure, values_from = value) %>%
mutate(abs_diff = abs(clustered - original),
squared_diff = (clustered - original)^2) %>% filter(coverage == "COLL") %>% arrange(weight) %>% mutate(ind = 1:n()) %>% ggplot(aes(x = log(weight), y = abs_diff, color = original)) + geom_point(alpha = .5) + geom_smooth(method = 'gam', formula = 'y ~ s(x, bs = "cs")') + scale_color_gradient2(low = "red", high = "blue", mid = "gray", midpoint = 1)
suppressWarnings(
ggExtra::ggMarginal(p, type = "histogram", margins = "x")
)
### ^ HIGHER WEIGHTS SHOULD HAVE LOWER ERRORS ON AVERAGE. WE SEE THE OPPOSITE OF THIS.
factors_compare %>%
mutate(
weight =
(factors %>%
filter(PD_loss_cost_partial_credibility_UNCAPPED != 0) %>%
pull(PD_loss_cost_partial_credibility_UNCAPPED))
) %>%
pivot_longer(cols = contains("factor")) %>%
mutate(name = gsub("pricing_factor_", "", name)) %>%
separate(name, into = c("coverage", "measure"), sep = "_") %>%
pivot_wider(names_from = measure, values_from = value) %>%
mutate(abs_diff = abs(clustered - original),
squared_diff = (clustered - original)^2) %>% filter(coverage == "COLL") %>%
summarize(cor(abs_diff, weight))
## # A tibble: 1 × 1
## `cor(abs_diff, weight)`
## <dbl>
## 1 0.0977
### ^ HIGHER WEIGHTS IN OUR DATASET ARE CORRELATED WITH GREATER DIFFERENCE BEFORE AND AFTER (THIS IS JUST A GUT CHECK TO ENSURE OUR ALGORITHM AND DATA TRANSFORMATIONS RAN PROPERLY)
factors_compare %>%
mutate(
weight =
(factors %>%
filter(PD_loss_cost_partial_credibility_UNCAPPED != 0) %>%
pull(PD_loss_cost_partial_credibility_UNCAPPED))
) %>%
pivot_longer(cols = contains("factor")) %>%
mutate(name = gsub("pricing_factor_", "", name)) %>%
separate(name, into = c("coverage", "measure"), sep = "_") %>%
pivot_wider(names_from = measure, values_from = value) %>%
mutate(abs_diff = abs(clustered - original),
squared_diff = (clustered - original)^2) %>% filter(coverage == "COLL") %>% arrange(weight) %>% mutate(ind = 1:n()) %>% ggplot(aes(x = original, y = clustered, color = weight)) + geom_point() + scale_color_gradient(low = "red", high = "blue") + geom_abline(slope = 1, intercept = 0) + ggtitle("COLL observations\nbelow black line means factor was decreased, above means increased")
## HARD TO REALLY DISCERN ANYTHING ELSE ABOUT OUR DATA FROM THIS PLOT ^ BUT WAS WORTH A TRY
These views don’t really help us to figure out why this problem is happening, but they do echo the fact that this is certainly happening in our data.
Upon inspection of the code, we are certain the algorithm is applied
properly. And with these data visualizations, we see how (but not why)
the problem is occuring. So let us now compare these results to the same
k-means algorithm run where we do not use weights (I ran and stored this
above as an object with clustering_150_no_weight). Lets
look at these same metrics and how they perform with this run.
We will compare both clustering run, and choose the best result. Namely, we want to get to the bottom of what is happening with the weighted MSE/MAE being worse than the raw MSE/MAE values.
## for an apples to apples comparison, we will make sure only the same observations are in both output datasets
factors_compare_no_weight_in_algo <-
clustered_factors_no_weight_in_algo %>%
rename_with(.cols = colnames(.)[-1],
.fn = ~paste0(., "_clustered")) %>%
left_join(
original_factors %>%
rename_with(.cols = colnames(.)[-1],
.fn = ~paste0(., "_original")),
by = "Zipcode"
) %>%
left_join(
factors %>%
select(Zipcode, PD_loss_cost_partial_credibility_UNCAPPED),
by = "Zipcode"
) %>%
## apply observation filter here
filter(PD_loss_cost_partial_credibility_UNCAPPED != 0)
## get our error metrics
factors_compare_no_weight_in_algo %>%
rename(weight = PD_loss_cost_partial_credibility_UNCAPPED) %>%
pivot_longer(cols = contains("factor")) %>%
mutate(name = gsub("pricing_factor_", "", name)) %>%
separate(name, into = c("coverage", "measure"), sep = "_") %>%
pivot_wider(names_from = measure, values_from = value) %>%
mutate(abs_diff = abs(clustered - original),
squared_diff = (clustered - original)^2) %>%
group_by(coverage) %>%
summarize(w_mae = round(weighted.mean(abs_diff, weight),4),
mae = round(mean(abs_diff),4),
diff_mae = w_mae - mae,
w_mse = round(weighted.mean(squared_diff, weight),4),
mse = round(mean(squared_diff),4),
diff_mse = w_mse - mse
)
## # A tibble: 7 × 7
## coverage w_mae mae diff_mae w_mse mse diff_mse
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 BI 0.0406 0.0348 0.0058 0.0031 0.0025 0.0006
## 2 COLL 0.0272 0.0237 0.0035 0.0016 0.0012 0.000400
## 3 COMP 0.0763 0.0674 0.00890 0.0107 0.0091 0.00160
## 4 MP 0.108 0.0831 0.0254 0.0324 0.023 0.0094
## 5 PD 0.0358 0.0319 0.0039 0.0025 0.0021 0.000400
## 6 UMBI 0.0518 0.0402 0.0116 0.0061 0.0043 0.0018
## 7 UMPD 0.04 0.0311 0.0089 0.0034 0.0023 0.0011
Okay, we see here that when we don’t fit k-means clustering with weights, then all of our mean error metrics (weighted and unweighted) get worse. This reinforces the initial inclination that weighted k-means would be preferable to k-means without weights; it means that our algorithm does do it’s job in giving more importance to observations with higher weight (in determining clusters and centroids). This all does beg the question: why do our weighted mean error measures perform worse than unweighted error measures? When you consider all of our output, and also the formulation of the algorithm’s fitting in general, then it becomes apparent that this is just due to the structure of our data itself:
For these reasons, and upon consulting these error metric comparisons, we consider our weighted k-means clustering fit a success. Recall, in our use-case, it is a requirement that we reduce the dimension of our factors to at most 150 territories, and I reckon this is still the most favorable way to do that; no matter what, we will always have to deal with some non-zero error, I assert that this is still the best way to reduce the dimension of our factors.
factors_compare %>%
mutate(weight =
(
factors %>%
filter(PD_loss_cost_partial_credibility_UNCAPPED != 0) %>%
pull(PD_loss_cost_partial_credibility_UNCAPPED)
)) %>%
pivot_longer(cols = contains("factor")) %>%
mutate(name = gsub("pricing_factor_", "", name)) %>%
separate(name, into = c("coverage", "measure"), sep = "_") %>%
pivot_wider(names_from = measure, values_from = value) %>%
mutate(diff = clustered - original) %>%
group_by(coverage) %>%
summarize(
q_00 = quantile(diff, 0.0),
q_10 = quantile(diff, 0.1),
q_20 = quantile(diff, 0.2),
q_30 = quantile(diff, 0.3),
q_40 = quantile(diff, 0.4),
q_50 = quantile(diff, 0.5),
q_60 = quantile(diff, 0.6),
q_70 = quantile(diff, 0.7),
q_80 = quantile(diff, 0.8),
q_90 = quantile(diff, 0.9),
q_100 = quantile(diff, 1.0),
.groups = "drop"
) %>%
mutate(across(.cols = colnames(.)[-1],
.fns = ~round(.,3))) %>%
t() %>%
as.data.frame() %>%
rownames_to_column() %>%
`colnames<-`(.[1,]) %>%
rename(quantile = coverage) %>%
`[`(-1,) %>%
`rownames<-`(NULL)
## quantile BI COLL COMP MP PD UMBI UMPD
## 1 q_00 -0.343 -0.251 -0.662 -2.674 -0.929 -0.519 -0.384
## 2 q_10 -0.057 -0.037 -0.112 -0.130 -0.047 -0.056 -0.048
## 3 q_20 -0.036 -0.025 -0.067 -0.068 -0.030 -0.030 -0.026
## 4 q_30 -0.022 -0.016 -0.038 -0.036 -0.018 -0.016 -0.016
## 5 q_40 -0.011 -0.007 -0.016 -0.015 -0.010 -0.008 -0.010
## 6 q_50 -0.001 0.000 0.003 0.002 0.000 0.003 -0.002
## 7 q_60 0.007 0.006 0.024 0.021 0.010 0.012 0.007
## 8 q_70 0.018 0.014 0.045 0.042 0.020 0.024 0.016
## 9 q_80 0.032 0.023 0.071 0.072 0.032 0.037 0.028
## 10 q_90 0.050 0.035 0.104 0.122 0.049 0.063 0.050
## 11 q_100 0.398 0.332 1.015 0.969 0.270 0.402 0.313
factors_compare %>%
mutate(
weight =
(factors %>%
filter(PD_loss_cost_partial_credibility_UNCAPPED != 0) %>%
pull(PD_loss_cost_partial_credibility_UNCAPPED))
) %>%
pivot_longer(cols = contains("factor")) %>%
mutate(name = gsub("pricing_factor_", "", name)) %>%
separate(name, into = c("coverage", "measure"), sep = "_") %>%
pivot_wider(names_from = measure, values_from = value) %>%
mutate(diff = clustered - original) %>%
ggplot(aes(x = diff)) +
geom_histogram(bins = 300) +
facet_wrap(~coverage)
dir.create("frames", showWarnings = FALSE)
factors_long <-
factors_plot2 %>%
pivot_longer(cols = contains("pricing_factor"),
names_to = "coverage",
values_to = "pricing_factor")
# Define color limits across all frames
lims <- range(factors_long$pricing_factor, na.rm = TRUE)
for (i in gsub("_Losses",
"_pricing_factor",
names(rev(sort(unlist( scales_ )))))
) {
frame_data <- factors_long %>% filter(coverage == !!i)
# Plot main and zoomed maps with frame-specific data
p_base <- ggplot(frame_data) +
geom_sf(aes(fill = pricing_factor), color = "black", linewidth = 0.05) +
scale_fill_gradient2(low = "blue", mid = "gray90", high = "red",
midpoint = 0, limits = lims, na.value = "white") +
geom_sf(data = ca_state, fill = NA, color = "black", linewidth = 0.5) +
theme_void()
p_LA <-
p_base +
coord_sf(xlim = c(-119, -117.5), ylim = c(33.5, 34.25)) # +
p_SF <-
p_base +
coord_sf(xlim = c(-122.75, -121.85), ylim = c(37.3, 38.3)) +
theme(legend.position = "none") +
ggtitle(i)
composed <-
p_base + theme(legend.position = "none") +
inset_element(p_LA, left = 0.5, bottom = 0.55, right = 1, top = 1) +
inset_element(p_SF, left = 0.68, bottom = 0.37, right = 1, top = .67) +
plot_annotation(title = paste0("change in ", i))
rm(p_base) ; rm(p_LA) ; rm(p_SF)
ggsave(filename = glue("frames/frame_{i}.png"),
plot = composed, width = 9, height = 9, dpi = 150)
}
# Create animation
img <- magick::image_read(list.files("frames", full.names = TRUE))
animation <- magick::image_animate(img, fps = 1)
magick::image_write(animation, "pricing_animation.gif")
We see here that the changes in our coverages’ factors are minimized for the coverages that had bigger scales (COLL, PD, BI). This is desirable. And we can be sure that the error we desired to minimize was done so properly because the k-means clustering does exactly that. Great!
For the sake of this exercise of ours’ being a toy project, I will conclude here. We can dive in further, but at this point in our analysis, I would be at least convinced enough that this is the most viable solution, the results flow through sensibly, and that all assumptions for our models inputs and outputs have been dealt with properly.
factors_plot %>%
ggplot() +
geom_sf(aes(fill = as.character(cluster_id)),
color = "black", linewidth = 0.05) +
geom_sf(data = ca_state, fill = NA, color = "black", linewidth = 0.5) +
ggtitle("All clusters")
\[\space\]
Finally, you may have been worried that the clusters seem to be set at random, geographically. Just look at how there seems to be no sense or pattern about the clusters at all when you consider the geography of California. The thing is, since we’re clustering 1812 ZIP codes into 150 groups, the small variation (in the grand scheme of things) and (relatively) low dimension (7 variables) give us a use case where the optimal solution is so fine tuned that it exceeds what makes sense geographically. If we had to make only 10 clusters, we would have seen much more of a trend. No wonder our error metrics were so low within each coverage.
\[\space\]
\[\space\] \[\space\] \[\space\] \[\space\] \[\space\]